Efficient Implementation of Equation-of-Motion Coupled-Cluster Singles and Doubles Method with the Density-Fitting Approximation: An Enhanced Algorithm for the Particle–Particle Ladder Term

An efficient implementation of the density-fitted equation-of-motion coupled-cluster singles and doubles (DF-EOM-CCSD) method is presented with an enhanced algorithm for the particle–particle ladder (PPL) term, which is the most expensive part of EOM-CCSD computations. To further improve the evaluation of the PPL term, a hybrid density-fitting/Cholesky decomposition (DF/CD) algorithm is also introduced. In the hybrid DF/CD approach, four virtual index integrals are constructed on-the-fly from the DF factors; then, their partial Cholesky decomposition is simultaneously performed. The computational cost of the DF-EOM-CCSD method for excitation energies is compared with that of the resolution of the identity EOM-CCSD (RI-EOM-CCSD) (from the Q-chem 5.3 package). Our results demonstrate that DF-EOM-CCSD excitation energies are significantly accelerated compared to RI-EOM-CCSD. There is more than a 2-fold reduction for the C8H18 molecule in the cc-pVTZ basis set with the restricted Hartree-Fock (RHF) reference. This cost savings results from the efficient evaluation of the PPL term. In the RHF based DF-EOM-CCSD method, the number of flops (NOF) is 1/4O2V4, while that of RI-EOM-CCSD was reported (Epifanovsky et al. J. Chem. Phys.2013, 139, 13410524116550) to be 5/8O2V4 for the PPL contraction term. Further, the NOF of VVVV-type integral transformation is 1/2V4Naux in our case, while it appears to be V4Naux for RI-EOM-CCSD. Hence, our implementation is 2.5 and 2.0 times more efficient compared to RI-EOM-CCSD for these expensive terms. For the unrestricted Hartree-Fock (UHF) reference, our implementation maintains its enhanced performance and provides a 1.8-fold reduction in the computational time compared to RI-EOM-CCSD for the C7H16 molecule. Our results indicate that our DF-EOM-CCSD implementation is 1.7 and 1.4 times more efficient compared with RI-EOM-CCSD for average computational cost per EOM-CCSD iteration. Moreover, our results show that the new hybrid DF/CD approach improves upon the DF algorithm, especially for large molecular systems. Overall, we conclude that the new hybrid DF/CD PPL algorithm is very promising for large-sized chemical systems.


INTRODUCTION
It is well-known that coupled-cluster (CC) methods provide accurate results for molecular properties for most chemical systems near equilibrium geometries. 1−13 For example, the coupled-cluster singles and doubles (CCSD) method 14 provides quite accurate results for most molecular systems at equilibrium geometries. The addition of a perturbative triples excitations correction [CCSD(T)] 10,11,15 further enhances CCSD and yields very accurate results for a broad range of molecular systems. 12,16−25 However, high computational costs of common CC methods, such as O(N 6 ) and O(N 7 ) for CCSD and CCSD(T) (where N is the number of basis functions), limits their applications to relatively small-sized chemical systems.
Accurate computations of excitation energies (EEs) is one of the most challenging problems in modern quantum chemistry. Equation-of-motion CC (EOM-CC) methods provide accurate results for excited-state properties for a broad range of chemical systems. 26−44 The accuracy of the EOM approach based on the CCSD model (EOM-CCSD) has been reported to be 0.1−0.2 eV. 28,31 However, as in the case of the ground-state CC methods, the computational cost and disk/memory requirements for the EOM-CC methods scale steeply with the system size.
Tensor decomposition techniques for electron repulsion integrals (ERIs) have been of significant interest in modern computational chemistry. 45−70 Density fitting (DF) is one of the most popular ERI decomposition techniques. 45−52,59−70 In the DF approach, the ERI tensor of rank-4 is expanded in terms of rank-3 tensors. Another common ERI factorization approach is the partial Cholesky decomposition (CD). [55][56][57][58][59][60]63,64 The DF and CD techniques are very useful to reduce the cost of integral transformations and the storage requirements for the ERI tensor.
In this research, a new implementation of the density-fitted EOM-CCSD method is presented with an enhanced algorithm for the particle−particle ladder (PPL) term, which is the most expensive term. The equations presented have been implemented in a new computer code by the present authors and added to the MACROQC package. 71 The computational time of our DF-EOM-CCSD implementation is compared with that of the Q-CHEM 5.3 software. 72 The DF-EOM-CCSD method is applied to a test set for excitation energies.

CCSD ENERGY AND AMPLITUDE EQUATIONS
At first, we would like to note that all equations reported in this study are in the spin−orbital formalism. The spin-free version of our equations for the restricted closed-shell systems are provided in the Supporting Information. The unrestricted version of the formulas can be readily obtained from the spin− orbital equations.
The correlation energy for the CCSD method can be expressed as follows where â † and îare the creation and annihilation operators and t i a and t ij ab are the single and double excitation amplitudes, respectively.
t i a and t ij ab can be obtained from the following equations: where Φ i a and Φ ij ab are singly and doubly excited Slater determinants, respectively. For explicit equations of our CCSD implementation, one may refer to our previous studies. 65−67

THE EOM-CCSD MODEL
In the EOM-CCSD framework, the target excited-state wave functions are written as follows: where R̂and L̂are linear excitation and de-excitation operators, respectively. For EOM-CCSD, R̂= R̂1 + R̂2: where r i a and r ij ab are the single and double excitation amplitudes, respectively, and the notation {â †...i} denotes a string of normalordered operators with respect to the Fermi vacuum.
For the ground state, we have the following Schrodinger equation: Hence, by multiplying eq 10 by e −T̂, we obtain where H̅ = e −T̂ĤeT̂. Further, the normal ordered H̅ can be written as follows: Hence, we define: where subscript c means that only connected diagrams should be included. Therefore, we may rewrite eq 11 as follows where ΔE is the ground-state CC correlation energy. The excited-state eigenvalue equation can be written as follows: where ΔE R is the excited-state CC correlation energy. The excitation energy can be written as After performing some algebra, we obtain the EOM-CCSD equation as follows: Equation 17 is equivalent to the following matrix eigenvalue equation for CCSD: i k j j j j j j j j j j j j y { z z z z z z z z z z z z i k j j j j j j j j j j j y { z z z z z z z z z z z i k j j j j j j j j j j j y However, we solve eq 18 iteratively with the Davidson algorithm. 74−77 Hence, we need to introduce the so-called σ vector as follows:  (28) where Q runs over auxiliary basis functions and the b pq Q terms are the molecular orbital (MO) basis DF factors, which are defined in our previous studies. 65 With the DF approximation, the EOM-CCSD σ i a equation can be written as With the DF approximation, the EOM-CCSD σ ij ab equation can be written as 3.3. PPL Algorithm with the DF Approach. The most expensive terms of the T 2 and σ 2 amplitude equations are the PPL terms. Our PPL algorithm for the σ 2 tensor originated from the PPL algorithm used for the T 2 amplitude equation in our 2016 study. 65 Here, we employ the same algorithm to σ 2 amplitudes. For example, for the closed-shell case, the PPL term can be written as Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article Following the previous studies of Saebø and Pulay 78 and Scuseria at al. 79 and our previous DF-CCSD studies, 65,66 we employ the following algorithm for the evaluation of σ-PPL: where S is the symmetric component, while A is the antisymmetric component. Now let us define where S and A have the following symmetry properties.
Hence, we can always keep i ≥ j and a ≥ b.
The pseudo code for the σ-PPL algorithm is With this algorithm, the cost of PPL is 1/4O Finally, we note that our DF-EOM-CCSD code has a sharedmemory parallelism feature through threaded BLAS calls as well as OPENMP parallelization of all tensor manipulations. 3.3.1. DF/CD Hybrid PPL Algorithm. In the common CD approach for ERIs, the CD factors are generated from the AO basis ERI tensor (μν|λσ), and the number of CD factors is generally higher than that of DF factors. Hence, it does not seem to speed up our DF algorithm. However, we have observed that, for the large molecules, the CD technique can be beneficial to take advantage of the sparsity of the ERI tensor if it is applied to the MO basis ERIs generated from the DF integrals. More specifically, if we perform the Cholesky decomposition of the (ab|cd)-type integrals, we may get a reduced number of auxiliary basis functions, which is especially true for large molecular systems. Hence, in our DF/CD hybrid approach, we build the (ab|cd)-type integrals from the DF factors, on-the-fly, and perform Cholesky decomposition simultaneously.

RESULTS AND DISCUSSION
Results from the DF-EOM-CCSD and RI-EOM-CCSD 80 methods were obtained for a set of alkanes to compare the computational cost for the excitation energy computations. For the alkanes set, Dunning's correlation-consistent polarized valence triple-ζ basis set (aug-cc-pVTZ) was used with the frozen core approximation. 81, 82 For the aug-cc-pVTZ basis sets, aug-cc-pVTZ-JKFIT 50 and aug-cc-pVTZ-RI 83 auxiliary basis set pairs were employed for reference and correlation energies, respectively. Additionally, the DF-EOM-CCSD, resolution of the identity EOM-CCSD (RI-EOM-CCSD), 80 and EOM-CCSD(fT) 84 methods were applied to a set of molecules to compare the excitation energies.
Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article 4.1. Efficiency of DF-EOM-CCSD. A set of alkanes is considered to assess the efficiency of the RI-EOM-CCSD and DF-EOM-CCSD methods. The RI-EOM-CCSD computations were performed with the Q-CHEM 5.3 package. 72 The computational time for the RI-EOM-CCSD and DF-EOM-CCSD methods are presented graphically in Figures 1 and 2 for restricted and unrestricted Hartree-Fock (RHF and UHF) references, respectively. Timing computations were carried out for a single root with a 10 −7 energy and 10 −7 EOM eigenvalue convergence tolerances on a single node (1 core) Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10 GHz computer (memory ∼ 64 GB). For the RI-CCSD code of Q-CHEM 5.3, MEM_TOTAL 64000, MEM_STATIC 2000, and CC_MEMORY 51200 options are used. We start our assessment with the RHF versions of the RI-EOM-CCSD and DF-EOM-CCSD implementations. The DF-EOM-CCSD method significantly reduces the computational cost compared to RI-EOM-CCSD, and there is more than a 2-fold reduction in the computational time for DF-EOM-CCSD for the largest member (C 8  The efficiency of our DF-EOM-CCSD method compared to that of RI-EOM-CCSD is attributed to the our more efficient PPL algorithm. For the closed-shell case, the number of flops (NOF) for our DF-CCSD method 65,66 is 1/4O 2 V 4 + 2O 3 V 3 + 1/ 4O 4 V 2 , while that of RI-CCSD 80 was reported to be 5/8O 2 V 4 + 4O 3 V 3 + 27/8O 4 V 2 . When one compares the cost of implementation, our DF-CCSD implementation 65,66 is 2.5 times more efficient than that of RI-CCSD 80 for the PPL contraction term. Further, our implementation is 2 times more efficient compared to that of RI-CCSD for the particle-hole ladder (PHL)terms. Moreover, the cost of VVVV-type integral transformation, on-the-fly of course, is 1/2V 4 N aux in our case, while it appears to be V 4 N aux for RI-CCSD. 80 In fact, the most expensive term is this integral transformation step for large-scale computations with optimized auxiliary basis sets. Hence, our algorithm appears to be 2 times more efficient for this term. Basically, we follow the same algorithm for the PPL term of EOM; the same is also true for RI-EOM-CCSD. Hence, the efficiency of our DF-EOM-CCSD implementation was maintained.
As the second step of our timing assessment, we consider the UHF versions of the RI-EOM-CCSD and DF-EOM-CCSD implementations. The DF-EOM-CCSD method noticeably reduces the computational cost compared with RI-EOM-CCSD ( Figure 2 72 ) and DF-EOM-CCSD methods with the cc-pVTZ basis set. The RHF reference is used for these computations. All computations were performed for a single root with 10 −7 energy and EOM eigenvalue convergence tolerances on a single node (1 core) Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10 GHz computer (memory ∼ 64 GB). Figure 2. Total, CCSD, and EOM wall time (in min) for computations of excitation energies for the C n H 2n+2 (n = 1−7) set from the RI-EOM-CCSD (from Q-CHEM 72 ) and DF-EOM-CCSD methods with the cc-pVTZ basis set. The UHF reference is used for these computations. All computations were performed for a single root with 10 −7 energy and EOM eigenvalue convergence tolerances on a single node (1 core) Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10 GHz computer (memory ∼ 64 GB). computational time per UHF-CCSD iteration (t ccsd /n iter ) for C 7 H 16 is 83.7 min (DF-EOM-CCSD) and 97.3 min (RI-EOM-CCSD). Hence, our new DF-EOM-CCSD implementation is 1.2 times faster than the RI-EOM-CCSD code for the average computational cost per UHF-CCSD iteration. Similarly, the number of Davidson iterations for the EOM part are 12 (DF-EOM-CCSD) and 18 (RI-EOM-CCSD). The average computational time per Davidson iteration (t eom /n iter ) for C 7 H 16 is 130.7 min (DF-EOM-CCSD) and 178.9 min (RI-EOM-CCSD). Hence, our new DF-EOM-CCSD implementation is 1.4 times faster than the RI-EOM-CCSD code for the average computational cost per EOM-CCSD iteration. Hence, our new DF-EOM-CCSD implementation maintains its efficiency for the UHF reference.
4.1.1. Assessment of the DF/CD Hybrid PPL Algorithm. As the final step of our assessment for the efficiency of our new implementations, we present benchmark timing results for comparisons of DF and DF/CD hybrid approaches for the evaluation of the PPL terms of the CCSD and EOM parts. The ratios of the number of auxiliary basis functions employed in the PPL term of EOM-CCSD for the DF and DF/CD approaches are presented graphically in Figure 3. Since the most expensive term of the PPL algorithm scales linearly with the number of auxiliary basis functions, lets call it M, the reduction of M may yield significant improvements in the evaluation of the PPL term. For example, for the C 9 H 20 molecule with the cc-pVTZ primary basis set, the M values are 1329 and 1208 for our canonical DF and hybrid DF/CD algorithms, respectively. Hence, the ratio of M DF /M DF/CD is 1.10, which indicates a more than 10% reduction in the number of auxiliary basis functions. For the alkanes set considered, C n H 2n+2 (n = 1−9), we plot the M DF /M DF/CD values with respect to the n values and obtain a linear relation for this fit. The equation and the R 2 value for the linear fit are M DF /M DF/CD = 0.0122n + 0.9883 and R 2 = 0.9939. At first, one should note that as the n value increases the M DF / M DF/CD ratio increases as well. The reason for this correlation is that as molecular size increases the hybrid DF/CD algorithm makes better use of the sparsity of the ERI tensor. Hence, the obtained linear equation indicates that, if we proceed to larger molecules, the hybrid DF/CD algorithm will have a larger impact on the computational time. For example, the M DF / M DF/CD ratio will be approximately 1.23 and 1.60 for the C 20 H 42 and C 50 H 102 molecules, which indicates up to 23% and 60% acceleration in the PPL terms can be achieved.
For the alkanes set, the computational time for the DF-EOM-CCSD and hybrid DF/CD-EOM-CCSD approaches with the CD tolerances of 10 −4 , 10 −3 , and 10 −2 are presented graphically in Figure 4. For the largest member of the test set, C 9 H 20 , the computational times are 2205.9 min (DF-EOM-CCSD), 2186.6 min (DF/CD-EOM-CCSD with tol CD = 10 −4 ), 1326.6 min (DF/CD-EOM-CCSD with tol CD = 10 −3 ), and 1232.0 min (DF/CD-EOM-CCSD with tol CD = 10 −2 ). With tol CD = 10 −4 , the cost of DF/CD-EOM-CCSD is slightly reduced compared with that of DF-EOM-CCSD, while with tol CD = 10 −3 and tol CD = 10 −2 , the cost of DF/CD-EOM-CCSD is reduced 39.9% and 44.1% compared with that of the canonical DF-EOM-CCSD. Hence, our new hybrid approach provides significant improvements in efficiency compared to the that of the canonical DF algorithm. Further, our above discussion suggests that for the larger molecules further improvements may be observed. Hence, our new hybrid DF/CD PPL algorithm appears to be very promising for large-sized chemical systems.
We would like to note why we did not prefer the CD decomposition of the original 4-index ERIs. The number of CD factors generated from the 4-index ERIs are generally much higher than that of the DF factors achieving the same accuracy.   approach significantly reduces the number of auxiliary basis functions.

Accuracy of DF-EOM-CCSD.
In this section, we consider a test set to assess the accuracy of the DF-EOM-CCSD method. Chemical names of the molecules considered are given   When these results and the computational efficiency are considered, the DF/CD-EOM-CCSD approach may be used with such loose CD tolerances. The differences between DF-EOM-CCSD and RI-EOM-CCSD results are in between 0.00 and 0.06 eV for most cases.

CONCLUSIONS
In this study, a new implementation of the density-fitted EOM-CCSD (DF-EOM-CCSD) method has been presented with an enhanced algorithm for the particle−particle ladder (PPL) term, which is the most expensive term of the EOM-CCSD computations. To further improve the evaluation of the PPL term, a hybrid DF/CD algorithm has also been introduced. The computational time of the DF-EOM-CCSD excitation energies has been compared with that of the RI-EOM-CCSD method (from Q-CHEM 5.3 package 72 ).
The DF-EOM-CCSD method significantly reduces the computational cost compared to that of the RI-EOM-CCSD method; there is more than a 2-fold reduction for the C 8 H 18 molecule in a cc-pVTZ basis set with the RHF reference. This cost savings results from the accelerated evaluation of the PPL term. In our RHF based DF-EOM-CCSD method, the number of flops (NOFs) for the PPL contraction term is 2.5 times lower than that of the RI-EOM-CCSD method. Further, the prefactor of the VVVV-type transformation step used in the PPL term has a reduced NOF value by a factor of 2. For the UHF reference, our implementation maintains its better performance and provides a 1.8-fold reduction in the computational cost compared to that of the RI-EOM-CCSD method for the C 7 H 16 molecule. Furthermore, our results show that the suggested hybrid DF/CD algorithm further improves the canonical DF algorithm, and the degree of improvement increases as the molecular size increases. The preliminary results indicate that the new hybrid DF/CD PPL algorithm is very promising for large-sized chemical systems.
Finally, the DF-EOM-CCSD and DF/CD-EOM-CCSD methods are applied to a test set to compare the excitation energies with those from the CIS, RI-EOM-CCSD, and EOM-CCSD(fT) methods. Our results demonstrate that the DF-EOM-CCSD and DF/CD-EOM-CCSD methods (with CD tolerances of 10 −4 −10 −3 ) provide identical results with to those of the RI-EOM-CCSD method. Further, the DF/CD-EOM-CCSD approach yields tolerable errors (0.03 and 0.06 eV) compared with those of the DF-EOM-CCSD method for the test set considered with loose CD tolerances, such as 5 × 10 −3 and 10 −2 .
■ APPENDIX A DF-CCSD 3-Index Intermediates. 1-and 3-index intermediates used for DF-CCSD are given as follows: 65 where τ̃= and b rs Q values are the molecular orbital (MO) DF factors, which are defined in our previous studies. 65 −67 F and Intermediates. Density-fitted F and intermediates are 65 (68) where f pq is the MO basis Fock matrix. We would like to note that in our previous studies 65 where P ± (pq) is defined by = ± ± P pq pq ( ) 1 ( ) (72) and pq ( ) acts to permute the indices p and q. and Intermediates. The and intermediates are defined as follows: 65 ■ ASSOCIATED CONTENT
Optimized geometries of the species; excitation energies with the cc-pVTZ basis set for the test set; CIS excitation energies with the aug-cc-pVTZ basis set for the test set (PDF) Spin-free RHF-DF-EOM-CCSD equations (PDF)